Random Projection with Filtering for Nearly Duplicate Search
نویسندگان
چکیده
High dimensional nearest neighbor search is a fundamental problem and has found applications in many domains. Although many hashing based approaches have been proposed for approximate nearest neighbor search in high dimensional space, one main drawback is that they often return many false positives that need to be filtered out by a post procedure. We propose a novel method to address this limitation in this paper. The key idea is to introduce a filtering procedure within the search algorithm, based on the compressed sensing theory, that effectively removes the false positive answers. We first obtain a sparse representation for each data point by the landmark based approach, after which we solve the nearly duplicate search that the difference between the query and its nearest neighbors forms a sparse vector living in a small `p ball, where p ≤ 1. Our empirical study on real-world datasets demonstrates the effectiveness of the proposed approach compared to the state-of-the-art hashing methods.
منابع مشابه
A Family of Selective Partial Update Affine Projection Adaptive Filtering Algorithms
In this paper we present a general formalism for the establishment of the family of selective partial update affine projection algorithms (SPU-APA). The SPU-APA, the SPU regularized APA (SPU-R-APA), the SPU partial rank algorithm (SPU-PRA), the SPU binormalized data reusing least mean squares (SPU-BNDR-LMS), and the SPU normalized LMS with orthogonal correction factors (SPU-NLMS-OCF) algorithms...
متن کاملEnhancing Keyword Search in Relational Databases Using Nearly Duplicate Records
The importance of supporting keyword searches on relations has been widely recognized. Different from the existing keyword search techniques on relations, this paper focuses on nearly duplicate records in relational databases due to abbreviation and typos. As a result, processing keyword searches with duplicate records involves many unique challenges. In this paper we discuss the motivation and...
متن کاملDuplicate Quora Questions Detection
Quora is a platform to ask questions and connect with people who contribute unique insights and quality answers. In this paper, we are mainly focusing on the duplicate questions detection. The main idea is to first vectorize questions and extract features, train and predict using machine learning techniques based on question vectors and features previously built. We implement two approaches to ...
متن کاملAsymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)
We present the first provably sublinear time hashing algorithm for approximate Maximum Inner Product Search (MIPS). Searching with (un-normalized) inner product as the underlying similarity measure is a known difficult problem and finding hashing schemes for MIPS was considered hard. While the existing Locality Sensitive Hashing (LSH) framework is insufficient for solving MIPS, in this paper we...
متن کاملApplication of Single-Frequency Time-Space Filtering Technique for Seismic Ground Roll and Random Noise Attenuation
Time-frequency filtering is an acceptable technique for attenuating noise in 2-D (time-space) and 3-D (time-space-space) reflection seismic data. The common approach for this purpose is transforming each seismic signal from 1-D time domain to a 2-D time-frequency domain and then denoising the signal by a designed filter and finally transforming back the filtered signal to original time domain. ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012